In this chapter we discuss regression models. The basic concept is that we forecast the time series of interest y assuming that it has a linear relationship with other time series \(x\).

For example, we might wish to forecast monthly sales \(y\) using total advertising spend \(x\) as a predictor. Or we might forecast daily electricity demand y using temperature \(x_1\) and the day of week \(x_2\) as predictors.

The forecast variable \(y\) is sometimes also called the regressand, dependent or explained variable. The predictor variables \(x\) are sometimes also called the regressors, independent or explanatory variables. In this book we will always refer to them as the “forecast” variable and “predictor” variables.

7.1 The Linear Model

In the simplest case, the regression model allows for a linear relationship between the forecast variable y and a single predictor variable \(x\):

\[y_t = \beta_0 + \beta_1x_t+\epsilon_t\]

An artificial example of data from such a model is shown in Figure 7.1. The coefficients \(\beta_0\) and \(\beta_1\) denote the intercept and the slope of the line respectively. The intercept \(\beta_0\) represents the predicted value of \(y\) when \(x\)=0. The slope \(\beta_1\) represents the average predicted change in \(y\) resulting from a one unit increase in \(x\).

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(fpp3)
## ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
## ✓ lubridate   1.8.0     ✓ feasts      0.2.2
## ✓ tsibble     1.1.0     ✓ fable       0.3.1
## ✓ tsibbledata 0.3.0
## ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()
ggplot(data = mtcars,mapping = aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = 'lm') +
  ggtitle(label = 'Wt vs mpg for MTCars')
## `geom_smooth()` using formula 'y ~ x'

Example: US Consumption Expenditure

Figure 7.2 shows time series of quarterly percentage changes (growth rates) of real personal consumption expenditure, \(y\), and real personal disposable income, \(x\), for the US from 1970 Q1 to 2019 Q2.

us_change %>% 
  pivot_longer(c(Consumption, Income), names_to = "Series") %>% 
  autoplot(value) +
  labs(y = "% change")

A scatter plot of the consumption changes against income changes is showin in Figure 7.3, along with the estimated regression line:

\[\hat{y_t} = 0.54 + 0.27x_t\]

us_change %>% 
  ggplot(aes(x = Income, y = Consumption)) +
  labs(y = "Consumption (quarterly % change)",
       x = "Income (quarterly % change)") +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

The equation is estimated using the TSLM() (time series liner model) function:

us_change %>% 
  model(TSLM(Consumption ~ Income)) %>% 
  report()
## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58236 -0.27777  0.01862  0.32330  1.42229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.54454    0.05403  10.079  < 2e-16 ***
## Income       0.27183    0.04673   5.817  2.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5905 on 196 degrees of freedom
## Multiple R-squared: 0.1472,  Adjusted R-squared: 0.1429
## F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08

Multiple Linear Regression

When there are two or more predictor variables, the model is called a multiple regression model. The general form of a multiple regression model is

\[y_t=\beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + ... + \beta_k x_{k, t} + \epsilon_t\]

where \(y\) is the variable to be forecast and \(x_1,…,x_k\) are the \(k\) predictor variables. Each of the predictor variables must be numerical. The coefficients \(\beta_1,...\beta_k\) measure the effect of each predictor after taking into account the effects of all the other predictors in the model. Thus, the coefficients measure the marginal effects of the predictor variables.

Example: US Consumption Expenditure

Figure 7.4 shows additional predictors that may be useful for forecasting US consumption expenditure. These are quarterly percentage changes in industrial production and personal savings, and quarterly changes in the unemployment rate (as this is already a percentage). Building a multiple linear regression model can potentially generate more accurate forecasts as we expect consumption expenditure to not only depend on personal income but on other predictors as well.

# Answered at: https://stackoverflow.com/questions/69606489/plotting-each-variable-in-a-time-series-on-the-same-plot/69606629#69606629
library(tidyverse)
us_change %>%
  pivot_longer(-Quarter) %>%
  ggplot(aes(Quarter, value, color = name)) +
  geom_line() +
  facet_wrap(~name, ncol = 1, scales = "free_y") +
  guides(color = "none")

Figure 7.5 is a scatterplot matrix of five variables. The first column shows the relationships between the forecast variable (consumption) and each of the predictors. The scatterplots show positive relationships with income and industrial production, and negative relationships with savings and unemployment. The strength of these relationships are shown by the correlation coefficients across the first row. The remaining scatterplots and correlation coefficients show the relationships between the predictors.

us_change %>% 
  GGally::ggpairs(columns = 2:6)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Assumptions

When we use a linear regression model, we are implicitly making some assumptions about the variables in Equation (7.1).

First, we assume that the model is a reasonable approximation to reality; that is, the relationship between the forecast variable and the predictor variables satisfies this linear equation.

Second, we make the following assumptions about the errors \(( \epsilon_1...\epsilon_T)\)

• they have mean zero; otherwise the forecasts will be systematically biased. • they are not autocorrelated; otherwise the forecasts will be inefficient, as there is more information in the data that can be exploited. • they are unrelated to the predictor variables; otherwise there would be more information that should be included in the systematic part of the model.

It is also useful to have the errors being normally distributed with a constant variance \(\sigma^2\)

in order to easily produce prediction intervals.

Another important assumption in the linear regression model is that each predictor \(x\) is not a random variable. If we were performing a controlled experiment in a laboratory, we could control the values of each \(x\) (so they would not be random) and observe the resulting values of \(y\). With observational data (including most data in business and economics), it is not possible to control the value of \(x\), we simply observe it. Hence we make this an assumption.

7.2 Least Squares Estimation

In practice, of course, we have a collection of observations but we do not know the values of the coefficients \(\beta_0, \beta_1, ... \beta_k\). These need to be estimated from the data.

The least squares principle provides a way of choosing the coefficients effectively by minimising the sum of the squared errors. That is, we choose the values of \(\beta_0, \beta_1, ... \beta_k\) that minimise

\(\sum_{t=1}^T \epsilon_t^2 = \sum_{t=1}^T (y_t - \beta_0 - \beta_1 x_{1,t} - \beta_2 x_{2,t} - ... - \beta_k x_{k, t})^2\)

This is called least squares estimation because it gives the least value for the sum of squared errors. Finding the best estimates of the coefficients is often called “fitting” the model to the data, or sometimes “learning” or “training” the model. The line shown in Figure 7.3 was obtained in this way.

The TSLM() function fits a linear regression model to time series data. It is similar to the lm() function which is widely used for linear models, but TSLM() provides additional facilities for handling time series.

Example: US Consumption Expenditure

A multiple linear regression model for US consumption is:

\(y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \beta_3 x_{3,t} + \beta_4 x_{4,t} + \epsilon_t\)

where \(y\) is the percentage change in real personal consumption expenditure, \(x_1\) is the percentage change in real personal disposable income, \(x_2\) is the percentage change in industrial production, \(x_3\) is the percentage change in personal savings and \(x_4\) is the change in the unemployment rate.

The following output provides information about the fitted model. The first column of Coefficients gives an estimate of each \(\beta\) coefficient and the second column gives its standard error (i.e., the standard deviation which would be obtained from repeatedly estimating the \(\beta\) coefficients on similar data sets). The standard error gives a measure of the uncertainty in the estimated \(\beta\) coefficient.

fit_consMR <- us_change %>% 
  model(tslm = TSLM((Consumption ~ Income + Production + Unemployment + Savings)))

report(fit_consMR)
## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90555 -0.15821 -0.03608  0.13618  1.15471 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.253105   0.034470   7.343 5.71e-12 ***
## Income        0.740583   0.040115  18.461  < 2e-16 ***
## Production    0.047173   0.023142   2.038   0.0429 *  
## Unemployment -0.174685   0.095511  -1.829   0.0689 .  
## Savings      -0.052890   0.002924 -18.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3102 on 193 degrees of freedom
## Multiple R-squared: 0.7683,  Adjusted R-squared: 0.7635
## F-statistic:   160 on 4 and 193 DF, p-value: < 2.22e-16

For forecasting purposes, the final two columns are of limited interest. The “t value” is the ratio of an estimated \(\beta\) coefficient to its standard error and the last column gives the p-value: the probability of the estimated \(\beta\) coefficient being as large as it is if there was no real relationship between consumption and the corresponding predictor. This is useful when studying the effect of each predictor, but is not particularly useful for forecasting.

Fitted Values

Predictions of y can be obtained by using the estimated coefficients in the regression equation and setting the error term to zero. In general we write:

\[\hat{y} = \hat{\beta_0} + \hat{\beta_1 x_{1,t}} +\hat{\beta_2}x_{2,t} + ... + \hat{\beta_k} x_{k,t} \] Plugging in the values of \(x_{1,t}, ... ,x_{k,t}\) for \(t\)=1,…,T returns predictions of \(y_t\) within the training set, referred to as fitted values. Note that these are predictions of the data used to estimate the model, not genuine forecasts of future values of \(y\).

The following plots show the actual values compared to the fitted values for the percentage change in the US consumption expenditure series. The time plot in Figure 7.6 shows that the fitted values follow the actual data fairly closely. This is verified by the strong positive relationship shown by the scatterplot in Figure 7.7.

augment(fit_consMR) %>% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Consumption, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  labs(y = NULL,
       title = "Percentage change in US consumption expenditure") +
  scale_color_manual(values = c(Data = "black", Fitted = "#D55E00")) +
  guides(colour = guide_legend(title = "Legend"))

augment(fit_consMR) %>% 
  ggplot(aes(x = Consumption, y = .fitted)) +
  geom_point() +
  labs(
    y = "Fitted (predicted values)",
    x = "Data (actual values)",
    title = "Percent change in US consumption expendutire"
  ) +
  geom_abline(intercept = 0, slope = 1)

Goodness-of-Fit

A common way to summarise how well a linear regression model fits the data is via the coefficient of determination, or \(R^2\). This can be calculated as the square of the correlation between the observed \(y\) values and the predicted \(\hat{y}\) values. Alternatively, it can also be calculated as,

\[R^2 = \frac{\sum(\hat{y}_{t} - \bar{y})^2}{\sum(y_{t}-\bar{y})^2},\]

where the summations are over all observations. Thus, it reflects the proportion of variation in the forecast variable that is accounted for (or explained) by the regression model.

In simple linear regression, the value of \(R^2\)is also equal to the square of the correlation between \(y\) and \(X\) (provided an intercept has been included).

If the predictions are close to the actual values, we would expect \(R^2\) to be close to 1. On the other hand, if the predictions are unrelated to the actual values, then \(R^2\)=0 (again, assuming there is an intercept). In all cases, \(R^2\) lies between 0 and 1.

The \(R^2\) value is used frequently, though often incorrectly, in forecasting. The value of \(R^2\) will never decrease when adding an extra predictor to the model and this can lead to over-fitting. There are no set rules for what is a good \(R^2\) value, and typical values of \(R^2\) depend on the type of data used. Validating a model’s forecasting performance on the test data is much better than measuring the \(R^2\) value on the training data.

Figure 7.7 plots the actual consumption expenditure values versus the fitted values. The correlation between these variables is \(r\)=0.877 hence \(R^2\)=0.768 (shown in the output above). In this case model does an excellent job as it explains 76.8% of the variation in the consumption data. Compare that to the \(R^2\) value of 0.15 obtained from the simple regression with the same data set in Section 7.1. Adding the three extra predictors has allowed a lot more of the variation in the consumption data to be explained.

Standard Error of the Regression

Another measure of how well the model has fitted the data is the standard deviation of the residuals, which is often known as the “residual standard error.” This is shown in the above output with the value 0.31.

It is calculated using:

\(\hat{\sigma_e} = \sqrt {\frac{1}{T - k - 1}\sum_{t = 1}^T e_t^2}\)

where k is the number of predictors in the model. Notice that we divide by T−k−1 because we have estimated k+1 parameters (the intercept and a coefficient for each predictor variable) in computing the residuals.

The standard error is related to the size of the average error that the model produces. We can compare this error to the sample mean of y or with the standard deviation of y

to gain some perspective on the accuracy of the model.

The standard error will be used when generating prediction intervals, discussed in Section 7.6.

7.3 Evaulating the regression model

The differences between the observed y values and the corresponding fitted ^y values are the training-set errors or “residuals” defined as:

\(e_t = y_t - \hat{y_t}\)
\(\text{ } = y_t - \hat{\beta_1}{x_{1,t}} - \hat{\beta_2}{x_{2,t}} - ... - \hat{\beta_k}{x_{k,t}}\)

for \(t = 1,...T.\) Each residual is the unpredictable component of the associated observation.

As a result of these properties, it is clear that the average of the residuals is zero, and that the correlation between the residuals and the observations for the predictor variable is also zero. (This is not necessarily true when the intercept is omitted from the model.)

ACF plot of residuclas

With time series data, it is highly likely that the value of a variable observed in the current time period will be similar to its value in the previous period, or even the period before that, and so on. Therefore when fitting a regression model to time series data, it is common to find autocorrelation in the residuals. In this case, the estimated model violates the assumption of no autocorrelation in the errors, and our forecasts may be inefficient — there is some information left over which should be accounted for in the model in order to obtain better forecasts. The forecasts from a model with autocorrelated errors are still unbiased, and so are not “wrong,” but they will usually have larger prediction intervals than they need to. Therefore we should always look at an ACF plot of the residuals.

Histogram of residuals

It is always a good idea to check whether the residuals are normally distributed. As we explained earlier, this is not essential for forecasting, but it does make the calculation of prediction intervals much easier.

Example

using the gg_tsresiduals() function introduced in Section 5.3, we can obtain all the useful residual diagnostics mentioned above:

fit_consMR %>% gg_tsresiduals()

augment(fit_consMR) %>% 
  features(.innov, ljung_box, lag = 10, dof = 5)
## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 tslm      18.9   0.00204

The time plot shows some changing variation over time, but is otherwise relatively unremarkable. This heteroscedasticity will potentially make the prediction interval coverage inaccurate.

The histogram shows that the residuals seem to be slightly skewed, which may also affect the coverage probability of the prediction intervals.

Residual plot against predictors

We would expect the residuals to be randomly scattered without showing any systematic patterns. A simple and quick way to check this is to examine scatterplots of the residuals against each of the predictor variables. If these scatterplots show a pattern, then the relationship may be nonlinear and the model will need to be modified accordingly. See Section 7.7 for a discussion of nonlinear regression

Example

The residuals from the multiple regression model for forecasting US consumption plotted against each predictor in Figure 7.9 seem to be randomly scattered. Therefore we are satisfied with these in this case.

us_change %>% 
  left_join(residuals(fit_consMR), by = "Quarter") %>% 
  pivot_longer(Income:Unemployment, names_to = "regressor", values_to = "x") %>% 
  ggplot(aes(x = x, y = .resid)) +
  geom_point()+
  facet_wrap(.~ regressor, scales = "free_x") +
  labs(y = "Residuals", x = "")

Residuals against fitted values

A plot of the residuals against the fitted values should also show no pattern. If a pattern is observed, there may be “heteroscedasticity” in the errors which means that the variance of the residuals may not be constant. If this problem occurs, a transformation of the forecast variable such as a logarithm or square root may be required (see Section 3.1).

Example

Continuing the previous example, Figure 7.10 shows the residuals plotted against the fitted values. The random scatter suggests the errors are homoscedastic.

augment(fit_consMR) %>% 
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  labs(x = "Fitted", y = "Residuals")

Spurious regression

More often than not, time series data are “non-stationary”; that is, the values of the time series do not fluctuate around a constant mean or with a constant variance. We will deal with time series stationarity in more detail in Chapter 9, but here we need to address the effect that non-stationary data can have on regression models.

Regressing non-stationary time series can lead to spurious regressions. The output of regressing Australian air passengers on rice production in Guinea is shown in Figure 7.13. High R2

and high residual autocorrelation can be signs of spurious regression. Notice these features in the output below. We discuss the issues surrounding non-stationary data and spurious regressions in more detail in Chapter 10.

Cases of spurious regression might appear to give reasonable short-term forecasts, but they will generally not continue to work into the future.

fit <- aus_airpassengers %>% 
  filter(Year <= 2011) %>% 
  left_join(guinea_rice,by = "Year") %>% 
  model(TSLM(Passengers ~ Production))
report(fit)
## Series: Passengers 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9448 -1.8917 -0.3272  1.8620 10.4210 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.493      1.203  -6.229 2.25e-07 ***
## Production    40.288      1.337  30.135  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.239 on 40 degrees of freedom
## Multiple R-squared: 0.9578,  Adjusted R-squared: 0.9568
## F-statistic: 908.1 on 1 and 40 DF, p-value: < 2.22e-16

Conclusion through section 7.3

7.4 Some useful predictors

Trend

It is common for time series data to be trending. A linear trend can be modelled by simply using \(x_{1,t} = t\) as a predictor, \[y_t = \beta_0 + \beta_1{t} + \epsilon_t\] where \(t = 1,...T.\) A trend variable can be specified in the TSLM() function using the trend() special. In section 7.7 we discuss how we can also model nonlinear trends

Dummy variables

So far, we have assumed that each predictor takes numerical values. But what about when a predictor is a categorical variable taking only two values (e.g., “yes” and “no”)? Such a variable might arise, for example, when forecasting daily sales and you want to take account of whether the day is a public holiday or not. So the predictor takes value “yes” on a public holiday, and “no” otherwise.

This situation can still be handled within the framework of multiple regression models by creating a “dummy variable” which takes value 1 corresponding to “yes” and 0 corresponding to “no.” A dummy variable is also known as an “indicator variable.”

A dummy variable can also be used to account for an outlier in the data. Rather than omit the outlier, a dummy variable removes its effect. In this case, the dummy variable takes value 1 for that observation and 0 everywhere else. An example is the case where a special event has occurred. For example when forecasting tourist arrivals to Brazil, we will need to account for the effect of the Rio de Janeiro summer Olympics in 2016.

If there are more than two categories, then the variable can be coded using several dummy variables (one fewer than the total number of categories). TSLM() will automatically handle this case if you specify a factor variable as a predictor. There is usually no need to manually create the corresponding dummy variables.

Example: Australian quarterly beer production

recent_production <- aus_production %>% 
  filter(year(Quarter) >= 1992)
recent_production %>% 
  autoplot(Beer) +
  labs(y = "Megalitres", title = "Australian quarterly beer production")

We want to forecast the value of future beer production. We can model this data using a regression model with a linear trend and quarterly dummy variables,

\(y_t = \beta_0 + \beta_1{t} + \beta_2{d_{2,t}} + \beta_3{d_{3,t}} + \beta_4{d_{4,t}} + \epsilon_t\)

where \(d_{i,t}\) = 1, if \(t\) is in quarter \(i\), and \(0\) otherwise. The first quarter variable has been omitted, so the coefficients associated with the other quarters are measures of the difference between those quarters and the first quarter.

fit_beer <- recent_production %>% 
  model(TSLM(Beer ~ trend() + season()))
report(fit_beer)
## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -42.9029  -7.5995  -0.4594   7.9908  21.7895 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   441.80044    3.73353 118.333  < 2e-16 ***
## trend()        -0.34027    0.06657  -5.111 2.73e-06 ***
## season()year2 -34.65973    3.96832  -8.734 9.10e-13 ***
## season()year3 -17.82164    4.02249  -4.430 3.45e-05 ***
## season()year4  72.79641    4.02305  18.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared: 0.9243,  Adjusted R-squared: 0.9199
## F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16

Note that trend() and season() are not standard functions; they are “special” functions that work within the TSLM() model formulae.

There is an average downward trend of -0.34 megalitres per quarter. On average, the second quarter has production of 34.7 megalitres lower than the first quarter, the third quarter has production of 17.8 megalitres lower than the first quarter, and the fourth quarter has production of 72.8 megalitres higher than the first quarter.

augment(fit_beer) %>% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Beer, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_color_manual(
    values = c(Data = "black", Fitted = "#D55E00")
  ) +
  labs(y = "Megaliters",
       title = "Austrailian quarterly beer production") +
  guides(colour = guide_legend(title = "Series"))

augment(fit_beer) %>%
  ggplot(aes(x = Beer, y = .fitted,
             colour = factor(quarter(Quarter)))) +
  geom_point() +
  labs(y = "Fitted", x = "Actual values",
       title = "Australian quarterly beer production") +
  geom_abline(intercept = 0, slope = 1) +
  guides(colour = guide_legend(title = "Quarter"))

Intervention variables

It is often necessary to model interventions that may have affected the variable to be forecast. For example, competitor activity, advertising expenditure, industrial action, and so on, can all have an effect.

When the effect lasts only for one period, we use a “spike” variable. This is a dummy variable that takes value one in the period of the intervention and zero elsewhere. A spike variable is equivalent to a dummy variable for handling an outlier.

Distributed lags

It is often useful to include advertising expenditure as a predictor. However, since the effect of advertising can last beyond the actual campaign, we need to include lagged values of advertising expenditure. Thus, the following predictors may be used. \(x_1\) = advertising for previous month; \(x_2\) =advertising for two months previously; . . . \(x_m\) = advertising for $m% months previously.

\(x_{1,t} = sin{\frac{2\pi t}{m}}, x_{2,t} = cos{\frac{2\pi t}{m}}, x_{3,t} = sin{\frac{4\pi t}{m}},\) \(x_{4,t} = sin{\frac{4\pi t}{m}}, x_{5,t} = cos{\frac{6\pi t}{m}}, x_{6,t} = sin{\frac{6\pi t}{m}},\)

and so on. If we have monthly seasonality, and we use the first 11 of these predictor variables, then we will get exactly the same forecasts as using 11 dummy variables.

With Fourier terms, we often need fewer predictors than with dummy variables, especially when m is large. This makes them useful for weekly data, for example, where m≈52. For short seasonal periods (e.g., quarterly data), there is little advantage in using Fourier terms over seasonal dummy variables.

These Fourier terms are produced using the fourier() function. For example, the Australian beer data can be modelled like this.

fourier_beer <- recent_production %>% 
  model(TSLM(Beer ~ trend() + fourier(K = 2)))
report(fourier_beer)
## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -42.9029  -7.5995  -0.4594   7.9908  21.7895 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        446.87920    2.87321 155.533  < 2e-16 ***
## trend()             -0.34027    0.06657  -5.111 2.73e-06 ***
## fourier(K = 2)C1_4   8.91082    2.01125   4.430 3.45e-05 ***
## fourier(K = 2)S1_4 -53.72807    2.01125 -26.714  < 2e-16 ***
## fourier(K = 2)C2_4 -13.98958    1.42256  -9.834 9.26e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared: 0.9243,  Adjusted R-squared: 0.9199
## F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16

The K argument to fourier() specifies how many pairs of sin and cos terms to include. The maximum allowed is \(K=\frac{m}{2}\) where \(m\)is the seasonal period. Because we have used the maximum here, the results are identical to those obtained when using seasonal dummy variables.

If only the first two Fourier terms are used (\(x_{1,t} \text{ and } x_{2,t}\)), the seasonal pattern will follow a simple sine wave. A regression model containing Fourier terms is often called a harmonic regression because the successive Fourier terms represent harmonics of the first two Fourier terms.

Conclusion through section 7.4

7.5 Selecting predictors

In this section, we will use a measure of predictive accuracy. Five such measures are introduced in this section. They can be shown using the glance() function, here applied to the model for US consumption:

glance(fit_consMR) %>% 
  select(adj_r_squared, CV, AIC, AICc, BIC)
## # A tibble: 1 × 5
##   adj_r_squared    CV   AIC  AICc   BIC
##           <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         0.763 0.104 -457. -456. -437.

We compare these values against the corresponding values from other models. For the CV, AIC, AICc and BIC measures, we want to find the model with the lowest value; for Adjusted R2, we seek the model with the highest value.

Adjusted R2

Computer output for a regression will always give the R2 value, discussed in Section 7.2. However, it is not a good measure of the predictive ability of a model. It measures how well the model fits the historical data, but not how well the model will forecast future data.

An alternative which is designed to overcome these problems is the adjusted R2:

\(\bar{R}^2 = 1 - (1 - R^2) \frac{T - 1}{T - k -1}\)

where \(T\) is the number of observations and \(k\) is the number of predictors. This is an improvement on \(R^2\), as it will no longer increase with each added predictor. Using this measure, the best model will be the one with the largest value of \(\bar{R}^2\). Maximising \(\bar{R}^2\) is equivalent to minimising the standard error \(\hat{\sigma_e}\) given in Equation 7.3

Maximising \(R^2\) works quite well as a method of selecting predictors, although it does tend to err on the side of selecting too many predictors.

Cross-validation

Time series cross-validation was introduced in Section 5.8 as a general tool for determining the predictive ability of a model. For regression models, it is also possible to use classical leave-one-out cross-validation to selection predictors (Bergmeir et al., 2018). This is faster and makes more efficient use of the data. The procedure uses the following steps:

  1. Remove observation \(t\) from the data set, and fit the model using the remaining data. Then compute the error (\(e_t^* = y_t - \hat{y_t}\)) for the omitted observation. (This is not the same as the residual because the \(t\)th observation was not used in estimating the value of \(\hat{y_t}\).)

  2. Repeat step 1 for \(t\) = 1,…,T.

  3. Compute the MSE from \(e_1^*, ..., e_T^*\). We shall call this the CV

Although this looks like a time-consuming procedure, there are fast methods of calculating CV, so that it takes no longer than fitting one model to the full data set. The equation for computing CV efficiently is given in Section 7.9. Under this criterion, the best model is the one with the smallest value of CV.

Akaike’s Information Criterion (lower is better)

A closely-related method is Akaike’s Information Criterion, which we define as:

\(AIC = T\text{log}(\frac{SSE}{T}) + 2(k + 2)\)

where T is the number of observations used for estimation and k is the number of predictors in the model. Different computer packages use slightly different definitions for the AIC, although they should all lead to the same model being selected. The k+2 part of the equation occurs because there are k+2 parameters in the model: the k coefficients for the predictors, the intercept and the variance of the residuals. The idea here is to penalise the fit of the model (SSE) with the number of parameters that need to be estimated.

Corrected Akaike’s Information Criterion (lower is better)

For small values of T, the AIC tends to select too many predictors, and so a bias-corrected version of the AIC has been developed:

\(AIC_c = AIC + \frac{2(k+2)(k+3)}{T - k - 3}\)

Schwarz’s Bayesian Information Criterion

A related measure is Schwarz’s Bayesian Information Criterion

\(BIC = T\text{log}(\frac{SSE}{T}) + (k + 2)log(T)\)

Which measure should we use?

While R2 is widely used, and has been around longer than the other measures, its tendency to select too many predictor variables makes it less suitable for forecasting.

Many statisticians like to use the BIC because it has the feature that if there is a true underlying model, the BIC will select that model given enough data. However, in reality, there is rarely, if ever, a true underlying model, and even if there was a true underlying model, selecting that model will not necessarily give the best forecasts (because the parameter estimates may not be accurate).

Consequently, we recommend that one of the AICc, AIC, or CV statistics be used, each of which has forecasting as their objective. If the value of T is large enough, they will all lead to the same model. In most of the examples in this book, we use the AICc value to select the forecasting model.

Example: US consumption

In the multiple regression example for forecasting US consumption we considered four predictors. With four predictors, there are 24=16 possible models. Now we can check if all four predictors are actually useful, or whether we can drop one or more of them. All 16 models were fitted and the results are summarised in Table 7.1. A X indicates that the predictor was included in the model, Hence the first row shows the measures of predictive accuracy for a model including all four predictors.

The results have been sorted according to the AICc. Therefore the best models are given at the top of the table, and the worst at the bottom of the table. Table 7.1: All 16 possible models for forecasting US consumption with 4 predictors.

Income Production Savings Unemployment Adj R2 CV AIC AICc BIC
X X X X 0.763 0.104 -456.6 -456.1 -436.9
X X X 0.761 0.105 -455.2 -454.9 -438.7
X X X 0.760 0.104 -454.4 -454.1 -437.9
X X 0.735 0.114 -435.7 -435.5 -422.6
X X X 0.366 0.271 -262.3 -262.0 -245.8
X X X 0.349 0.279 -257.1 -256.8 -240.7
X X 0.345 0.276 -256.9 -256.6 -243.7
X X 0.336 0.282 -254.2 -254.0 -241.0
X X 0.324 0.287 -250.7 -250.5 -237.5
X X 0.311 0.291 -246.9 -246.7 -233.7
X X 0.308 0.293 -246.1 -245.9 -232.9
X 0.276 0.304 -238.1 -238.0 -228.2
X 0.274 0.303 -237.4 -237.3 -227.5
X 0.143 0.356 -204.6 -204.5 -194.7
X 0.061 0.388 -186.5 -186.4 -176.7
0.000 0.409 -175.1 -175.0 -168.5

Best subset regression

Stepwise regression

If there are a large number of predictors, it is not possible to fit all possible models. For example, 40 predictors leads to 40 > 1 trillion possible models! Consequently, a strategy is required to limit the number of models to be explored.

An approach that works quite well is backwards stepwise regression:

• Start with the model containing all potential predictors. • Remove one predictor at a time. Keep the model if it improves the measure of predictive accuracy. • Iterate until no further improvement.

If the number of potential predictors is too large, then the backwards stepwise regression will not work and forward stepwise regression can be used instead. This procedure starts with a model that includes only the intercept. Predictors are added one at a time, and the one that most improves the measure of predictive accuracy is retained in the model. The procedure is repeated until no further improvement can be achieved.

Beware of inference after selecting predictors

We do not discuss statistical inference of the predictors in this book (e.g., looking at p-values associated with each predictor). If you do wish to look at the statistical significance of the predictors, beware that any procedure involving selecting predictors first will invalidate the assumptions behind the p-values. The procedures we recommend for selecting predictors are helpful when the model is used for forecasting; they are not helpful if you wish to study the effect of any predictor on the forecast variable.

Completion through section 7.5

7.6 Forecasting with regression

Recall that the predictio of \(y\) can be obtained using:

\(\hat{y_t} = \hat\beta_0 + \hat\beta_1 x_{1,t} + \hat\beta_2 x_{2,t} + ... + \hat\beta_k x_{k,t}\)

which comprises the estimated coefficients and ignores the error in the regression equation. Plugging in the values of the predictor variables \(x_{1,t},...,x_{k,t}\) for \(t = 1,...,T\) returns the fitted (training set) values of y. What we are interested in here, however, is forecasting future values of \(y\).

Ex-ante versus ex-post forecasts

When using regression models for time series data, we need to distinguish between the different types of forecasts that can be produced, depending on what is assumed to be known when the forecasts are computed.

Ex-ante forecasts are those that are made using only the information that is available in advance. For example, ex-ante forecasts for the percentage change in US consumption for quarters following the end of the sample, should only use information that was available up to and including 2019 Q2. These are genuine forecasts, made in advance using whatever information is available at the time. Therefore in order to generate ex-ante forecasts, the model requires forecasts of the predictors. To obtain these we can use one of the simple methods introduced in Section 5.2 or more sophisticated pure time series approaches that follow in Chapters 8 and 9. Alternatively, forecasts from some other source, such as a government agency, may be available and can be used.

Ex-post forecasts are those that are made using later information on the predictors. For example, ex-post forecasts of consumption may use the actual observations of the predictors, once these have been observed. These are not genuine forecasts, but are useful for studying the behaviour of forecasting models.

The model from which ex-post forecasts are produced should not be estimated using data from the forecast period. That is, ex-post forecasts can assume knowledge of the predictor variables (the x variables), but should not assume knowledge of the data that are to be forecast (the \(y\) variable).

Example: Australian quarterly beer production

Normally, we cannot use actual future values of the predictor variables when producing ex-ante forecasts because their values will not be known in advance. However, the special predictors introduced in Section 7.4 are all known in advance, as they are based on calendar variables (e.g., seasonal dummy variables or public holiday indicators) or deterministic functions of time (e.g. time trend). In such cases, there is no difference between ex-ante and ex-post forecasts.

recent_production <- aus_production %>% 
  filter(year(Quarter) >= 1992)
fit_beer <- recent_production %>% 
  model(TSLM(Beer ~ trend() + season()))
fc_beer <- forecast(fit_beer)
fc_beer %>% 
  autoplot(recent_production) +
  labs(
    title = "Forecasts of beer production using regression",
    y = "Megalitres"
  )

Senario based forecasting

In this setting, the forecaster assumes possible scenarios for the predictor variables that are of interest. For example, a US policy maker may be interested in comparing the predicted change in consumption when there is a constant growth of 1% and 0.5% respectively for income and savings with no change in the employment rate, versus a respective decline of 1% and 0.5%, for each of the four quarters following the end of the sample. The resulting forecasts are calculated below and shown in Figure 7.18. We should note that prediction intervals for scenario based forecasts do not include the uncertainty associated with the future values of the predictor variables. They assume that the values of the predictors are known in advance.

fit_consBest <- us_change %>% 
  model(
    lm = TSLM(Consumption ~ Income + Savings + Unemployment)
  )
future_scenarios <- scenarios(
  Increase = new_data(us_change, 4) %>% 
    mutate(Income = 1, Savings = 0.5, Unemployment = 0),
  Decrease = new_data(us_change, 4) %>% 
    mutate(Income = -1, Savings = -0.5, Unemployment = 0),
  names_to = "Scenario")

fc <- forecast(fit_consBest, new_data = future_scenarios)

us_change %>% 
  autoplot(Consumption) +
  autolayer(fc) +
  labs(title = "US Consumption", y = "% change")

Building a predictive regression model

The great advantage of regression models is that they can be used to capture important relationships between the forecast variable of interest and the predictor variables. A major challenge however, is that in order to generate ex-ante forecasts, the model requires future values of each predictor. If scenario based forecasting is of interest then these models are extremely useful. However, if ex-ante forecasting is the main focus, obtaining forecasts of the predictors can be challenging (in many cases generating forecasts for the predictor variables can be more challenging than forecasting directly the forecast variable without using predictors).

An alternative formulation is to use as predictors their lagged values. Assuming that we are interested in generating a h-step ahead forecast we write

\(y_{t+h} = \beta_0 + \beta_1{x_{1,t}} + ... + \beta_k x_{k,t} + \epsilon_{t + h}\)

for h=1,2…. The predictor set is formed by values of the xs that are observed h time periods prior to observing y. Therefore when the estimated model is projected into the future, i.e., beyond the end of the sample T, all predictor values are available.

Including lagged values of the predictors does not only make the model operational for easily generating forecasts, it also makes it intuitively appealing. For example, the effect of a policy change with the aim of increasing production may not have an instantaneous effect on consumption expenditure. It is most likely that this will happen with a lagging effect. We touched upon this in Section 7.4 when briefly introducing distributed lags as predictors. Several directions for generalising regression models to better incorporate the rich dynamics observed in time series are discussed in Section 10.

Prediction intervals

With each forecast for the change in consumption in Figure 7.18, 95% and 80% prediction intervals are also included. The general formulation of how to calculate prediction intervals for multiple regression models is presented in Section 7.9. As this involves some advanced matrix algebra we present here the case for calculating prediction intervals for a simple regression, where a forecast can be generated using the equation,

\(\hat{y} = \hat{\beta_0} + \beta_1 x\)

Assuming that the regression errors are normally distributed, an approximate 95% prediction interval associated with this forecast is given by:

\(\hat{y} \pm 1.96 \hat{\sigma_e} \sqrt{1 + \frac{1}{T} + \frac{(x - \bar{x}^2)}{(T - 1) S^2_x}}\)

where T is the total number of observations, \(\bar{x}\) is the mean of the observed \(x\) values, \(s_x\) is the standard deviation of the observed \(x\) values and \(\hat\sigma_e\) is the standard error of the regression given by Equation (7.3). Similarly, an 80% prediction interval can be obtained by replacing 1.96 by 1.28. Other prediction intervals can be obtained by replacing the 1.96 with the appropriate value given in Table 5.1. If the fable package is used to obtain prediction intervals, more exact calculations are obtained (especially for small values of T) than what is given by Equation (7.4).

Example

The estimated simple regression line in the US consumption example is:

\(\hat{y_t} = 0.54 + 0.27x_t\)

Assuming that for the next four quarters, personal income will increase by its historical mean value of \(\bar{x}\)=0.73%, consumption is forecast to increase by 0.74% and the corresponding 80% and 95% prediction intervals are [−0.02,1.5] and [−0.42,1.9] respectively (calculated using R). If we assume an extreme increase of 12% in income, then the prediction intervals are considerably wider as shown in Figure 7.19.

fit_cons <- us_change %>% 
  model(TSLM(Consumption ~ Income))
new_cons <- scenarios(
  "Average increase" = new_data(us_change, 4) %>% 
    mutate(Income = mean(us_change$Income)),
  "Extreme increase" = new_data(us_change, 4) %>% 
    mutate(Income = 12),
  names_to = "Scenario"
)

fcast <- forecast(fit_cons, new_cons)

us_change %>% 
  autoplot(Consumption) +
  autolayer(fcast) +
  labs(title = "US consumption", y = "% change")

Completion through section 7.6, I feel very good about the scenario methods and solutions here, they are incredibly strong and useful.

7.7 Nonlinear regression

Although the linear relationship assumed so far in this chapter is often adequate, there are many cases in which a nonlinear functional form is more suitable. To keep things simple in this section we assume that we only have one predictor \(x\).

The simplest way of modelling a nonlinear relationship is to transform the forecast variable \(y\) and/or the predictor variable \(x\) before estimating a regression model. While this provides a non-linear functional form, the model is still linear in the parameters. The most commonly used transformation is the (natural) logarithm (see Section 3.1).

A log-log functional form is specified as:

\(\text{log}{y} = \beta_0 + \beta{1}\text{ log } x + \epsilon\)

n this model, the slope \(\beta_1\) can be interpreted as an elasticity: \(\beta_1\) is the average percentage change in \(y\) resulting from a 1% increase in \(x\). Other useful forms can also be specified. The log-linear form is specified by only transforming the forecast variable and the linear-log form is obtained by transforming the predictor.

Recall that in order to perform a logarithmic transformation to a variable, all of its observed values must be greater than zero. In the case that variable x contains zeros, we use the transformation log(x+1); i.e., we add one to the value of the variable and then take logarithms. This has a similar effect to taking logarithms but avoids the problem of zeros. It also has the neat side-effect of zeros on the original scale remaining zeros on the transformed scale.

There are cases for which simply transforming the data will not be adequate and a more general specification may be required. Then the model we use is:

\(y = f(x) + \epsilon\)

where \(f\) is a nonlinear function. In standard (linear) regressoin, \(f(x) = \beta_0 + \beta_1 x\). In the specification of nonlinear regression that follows, we allow f to be a more flexible nonlinear function of x, compared to simply a logarithmic or other transformation.

One of the simplest specifications is to make \(f\) piecewise linear. That is, we introduce points where the slope of \(f\) can change. These points are called knots. This can be achieved by letting \(x_{1,t} = x\) and introducing variable \(x_{2,t}\) such that

\(x_{2,t} = (x - c)_+ = \begin{cases}{0 \text{ if } x<c}\\{x - c} \text{ if } \geq{c} \end{cases}\)

The notation \((x - c)_+\) means the value x−c if it is positive and 0 otherwise. This forces the slope to bend at point c. Additional bends can be included in the relationship by adding further variables of the above form.

Piecewise linear relationships constructed in this way are a special case of regression splines. In general, a linear regression spline is obtained using:

\(x_1 = x \quad x_2 = (x - c_1)_+ \quad ... \quad x_k = (x - c_{k-1})_+\)

where \(c_1,...c_{k-1}\) are the knots (the points at which the line can bend). Selecting the number of knots \((k−1)\) and where they should be positioned can be difficult and somewhat arbitrary. Some automatic knot selection algorithms are available, but are not widely used.

Forecasting with a nonlinear trend

We can think of this as a nonlinear trend constructed of linear pieces. If the trend bends at time \(\tau\), then it can be specified by simply replacing \(x = t\) and \(c = \tau\) above such that we include the predictors:

\(x_{1,t} = t\) \(x_{2,t} = (t - \tau)_+ = \begin{cases}0 \quad \quad \text{ if } t < \tau \\t - \tau \quad \text{if } t \geq \tau \end{cases}\)

Example: Boston marathon winning times

We will fit some trend models to the Boston marathon winning times for men. First we extract the men’s data and convert the winning times to a numerical value. The course was lengthened (from 24.5 miles to 26.2 miles) in 1924, which led to a jump in the winning times, so we only consider data from that date onwards.

boston_men <- boston_marathon %>% 
  filter(Year >= 1924) %>% 
  filter(Event == "Men's open division") %>% 
  mutate(Minutes = as.numeric(Time)/60) %>% 
  select(Year, Minutes)

boston_men <- as_tsibble(boston_men)

autoplot(boston_men) +
  geom_smooth(method = 'lm') +
  labs(title = "Boston marathon winning times")
## Plot variable not specified, automatically selected `.vars = Minutes`
## `geom_smooth()` using formula 'y ~ x'

boston_men <- boston_marathon %>% 
  filter(Year >= 1924) %>% 
  filter(Event == "Men's open division") %>% 
  mutate(Minutes = as.numeric(Time)/60) %>% 
  select(Year, Minutes)

boston_men <- as_tsibble(boston_men)

aug <- boston_men %>% 
  model(NAIVE(Minutes)) %>% 
  augment()

aug %>% 
  ACF(.innov) %>% 
  autoplot()

boston_men %>% 
  model(MEAN(Minutes)) %>% 
  gg_tsresiduals()

Fitting an exponential trend (equivalent to a log-linear regression) to the data can be achieved by transforming the \(y\) variable so that the model to be fitted is:

\(\text{log } y_t = \beta_0 + \beta_1 t + \epsilon_t\)

The plot of winning times reveals three different periods. There is a lot of volatility in the winning times up to about 1950, with the winning times barely declining. After 1950 there is a clear decrease in times, followed by a flattening out after the 1980s, with the suggestion of an upturn towards the end of the sample. To account for these changes, we specify the years 1950 and 1980 as knots. We should warn here that subjective identification of knots can lead to over-fitting, which can be detrimental to the forecast performance of a model, and should be performed with caution.

fit_trends <- boston_men %>% 
  model(
    linear = TSLM(Minutes ~ trend()),
    exponential = TSLM(log(Minutes) ~ trend()),
    piecewise = TSLM(Minutes ~ trend(knots = c(1950, 1980)))
  )

fc_trends <- fit_trends %>% forecast(h = 10)

boston_men %>% 
  autoplot(Minutes) +
  geom_line(data = fitted(fit_trends),
            aes(y = .fitted, colour = .model)) +
  autolayer(fc_trends, alpha = 0.5, level = 95) +
  labs(y = "Minutes",
       title = "Boston marathon winning times")

Figure 7.21 shows the fitted lines and forecasts from linear, exponential and piecewise linear trends. The best forecasts appear to come from the piecewise linear trend.

Completion through section 7.7!

7.8 Correlation, causation and forecasting

Correlation is NOT causation

It is important not to confuse correlation with causation, or causation with forecasting. A variable x may be useful for forecasting a variable y, but that does not mean x is causing y. It is possible that x is causing y, but it may be that y is causing x

, or that the relationship between them is more complicated than simple causality.

For example, it is possible to model the number of drownings at a beach resort each month with the number of ice-creams sold in the same period. The model can give reasonable forecasts, not because ice-creams cause drownings, but because people eat more ice-creams on hot days when they are also more likely to go swimming. So the two variables (ice-cream sales and drownings) are correlated, but one is not causing the other. They are both caused by a third variable (temperature). This is an example of “confounding” — where an omitted variable causes changes in both the response variable and at least one predictor variables.

We describe a variable that is not included in our forecasting model as a confounder when it influences both the response variable and at least one predictor variable. Confounding makes it difficult to determine what variables are causing changes in other variables, but it does not necessarily make forecasting more difficult.

Similarly, it is possible to forecast if it will rain in the afternoon by observing the number of cyclists on the road in the morning. When there are fewer cyclists than usual, it is more likely to rain later in the day. The model can give reasonable forecasts, not because cyclists prevent rain, but because people are more likely to cycle when the published weather forecast is for a dry day. In this case, there is a causal relationship, but in the opposite direction to our forecasting model. The number of cyclists falls because there is rain forecast. That is, \(y\) (rainfall) is affecting \(x\) (cyclists).

It is important to understand that correlations are useful for forecasting, even when there is no causal relationship between the two variables, or when the causality runs in the opposite direction to the model, or when there is confounding.

However, often a better model is possible if a causal mechanism can be determined. A better model for drownings will probably include temperatures and visitor numbers and exclude ice-cream sales. A good forecasting model for rainfall will not include cyclists, but it will include atmospheric observations from the previous few days.

Forecasting with correlated predictors

When two or more predictors are highly correlated it is always challenging to accurately separate their individual effects. Suppose we are forecasting monthly sales of a company for 2012, using data from 2000–2011. In January 2008, a new competitor came into the market and started taking some market share. At the same time, the economy began to decline. In your forecasting model, you include both competitor activity (measured using advertising time on a local television station) and the health of the economy (measured using GDP). It will not be possible to separate the effects of these two predictors because they are highly correlated.

Having correlated predictors is not really a problem for forecasting, as we can still compute forecasts without needing to separate out the effects of the predictors. However, it becomes a problem with scenario forecasting as the scenarios should take account of the relationships between predictors. It is also a problem if some historical analysis of the contributions of various predictors is required.

Multicollinearity and forecasting

A closely related issue is multicollinearity, which occurs when similar information is provided by two or more of the predictor variables in a multiple regression.

It can occur when two predictors are highly correlated with each other (that is, they have a correlation coefficient close to +1 or -1). In this case, knowing the value of one of the variables tells you a lot about the value of the other variable. Hence, they are providing similar information. For example, foot size can be used to predict height, but including the size of both left and right feet in the same model is not going to make the forecasts any better, although it won’t make them worse either.

Multicollinearity can also occur when a linear combination of predictors is highly correlated with another linear combination of predictors. In this case, knowing the value of the first group of predictors tells you a lot about the value of the second group of predictors. Hence, they are providing similar information.

If there is high correlation (close to but not equal to +1 or -1), then the estimation of the regression coefficients is computationally difficult. In fact, some software (notably Microsoft Excel) may give highly inaccurate estimates of the coefficients. Most reputable statistical software will use algorithms to limit the effect of multicollinearity on the coefficient estimates, but you do need to be careful. The major software packages such as R, SPSS, SAS and Stata all use estimation algorithms to avoid the problem as much as possible.

When multicollinearity is present, the uncertainty associated with individual regression coefficients will be large. This is because they are difficult to estimate. Consequently, statistical tests (e.g., t-tests) on regression coefficients are unreliable. (In forecasting we are rarely interested in such tests.) Also, it will not be possible to make accurate statements about the contribution of each separate predictor to the forecast.

Forecasts will be unreliable if the values of the future predictors are outside the range of the historical values of the predictors. For example, suppose you have fitted a regression model with predictors x1 and x2 which are highly correlated with each other, and suppose that the values of x1 in the training data ranged between 0 and 100. Then forecasts based on x1>100 or x1<0

will be unreliable. It is always a little dangerous when future values of the predictors lie much outside the historical range, but it is especially problematic when multicollinearity is present.

Note that if you are using good statistical software, if you are not interested in the specific contributions of each predictor, and if the future values of your predictor variables are within their historical ranges, there is nothing to worry about — multicollinearity is not a problem except when there is perfect correlation.

7.9 Matrix formulation

Recall that multiple regression model can be written as:

\(y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + ... + \beta_k x_{k,t} + \epsilon_t\)

where \(epsilon_t\) has mean zero and variance \(\sigma^2\). This expresses the relationship between a single value of the forecast vairable and the predictors.

It can be convenient to write this in matrix form where all the values of the forecast variable are given in a single equation. Let \(\textbf{y} = (y_1,...,y_T)', \beta = (\beta_0,...,\beta_k)' and\)

\(\textbf{X} = \begin{bmatrix}1 \quad x_{1,1} \quad x_{2,1} \quad ... \quad x_{k,1}\\1 \quad x_{1,2} \quad x_{2,2}, \quad ... \quad x_{k,2}\\ .\quad \quad . \quad \quad . \quad \quad . \quad \\ .\quad \quad . \quad \quad . \quad \quad . \quad \\.\quad \quad . \quad \quad . \quad \quad . \quad \\1 \quad x_{1,T} \quad x_{2,T} \quad ... \quad x_{k,T} \end{bmatrix}\)

where \({\epsilon}\) has mean 0 and variance \(\sigma^2\textbf{I}\). Note that the \(\textbf{X}\) matrix has T rows reflecting the number of observations and k+1 columns reflecting the intercept which is represented by the column of ones plus the number of predictors.

Least squares estimation

Least squares estimation is performed by minimising the expression \(\epsilon'\epsilon = (\textbf{y} - \textbf{X}\beta)'(\textbf{y} - \textbf{X}\beta}\) It can be shown that this is minimixed when \(\beta\) takes the value:

\(\hat{\beta} = (\textbf{X'X})^{-1}^\textbf{Xy}\)

The residual variance is estimated using:

\(\hat{\sigma}_e^2 = \frac{1}{T - k - 1}(\textbf{y} - \textbf{X}\hat{\beta})' (\textbf{y - X}\beta)\)

Fitted values and cross-validation

The normal equation shows that the fitted values can be calculated using

\(\hat{\textbf{y}} = \textbf{X}\hat{\beta} = {X(X'X)^{-1}}X'y = Hy\)

where \(H = X(X'X)^{-1}X'\) is known as the “hat-matrix” because it is used to compute \(\hat{y}\)

If the diagonal values of H are denoted by \(h_1,...h_T,\) then the cross-validation statistic can be computed using:

\(CV = \frac{1}{T}\sum_{t=1}^t [\frac{e_t}{1 - h_t}]^2\)

where e_t is the residula obtained by fitting the model to all \(T\) observations. Thus, it is not necessary to actually fit \(T\) separater models when computing the CV statistic.

Forecasts and prediction intervals

Let \(x^*\) be a row vector containing the values of the predictors (in the same format as X) for which we want to generate a forecast. Then the forecast is given by:

\(\hat{y} = x*\hat{\beta} = x*(X'X)^{-1}X'y\)

The estimated variance is given by:

\(\hat{\sigma_e}^2[1 + x*(X'X)^{-1}X'y]\)

Completion of chapter 7!